Descriptive statistics are measures that summarize important features of data, often with a single number. Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis.
dataset <- read.delim("phages.tsv")
Mean, Median, Mode, and Range
mean(dataset$molGC...)
[1] 48.42675
# colMeans() gets the means for all columns in a data frame
# colMeans(dataset) # generates an error because not all columns have continuous data
colMeans(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250 3.687500 2.812500
# rowMeans() gets the means for all rows in a data frame
rowMeans(mtcars)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360 Merc 240D
29.90727 29.98136 23.59818 38.73955 53.66455 35.04909 59.72000 24.63455
Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
27.23364 31.86000 31.78727 46.43091 46.50000 46.35000 66.23273 66.05855
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
65.97227 19.44091 17.74227 18.81409 24.88864 47.24091 46.00773 58.75273
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
57.37955 18.92864 24.77909 24.88027 60.97182 34.50818 63.15545 26.26273
# Get only first 6 rows
head(rowMeans(mtcars))
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
29.90727 29.98136 23.59818 38.73955 53.66455 35.04909
median(dataset$molGC....)
[1] 46.308
colMedians <- apply(mtcars,
MARGIN = 2, # Operate on columns
FUN = median
) # Use function median
colMedians
mpg cyl disp hp drat wt qsec vs am gear carb
19.200 6.000 196.300 123.000 3.695 3.325 17.710 0.000 0.000 4.000 2.000
range(dataset$molGC....)
[1] 22.070 72.692
max(dataset$molGC....)
[1] 72.692
min(dataset$molGC....)
[1] 22.07
Variance and standard deviation
The variance of a distribution is the average of the squared deviations (differences) from the mean. Use the built-in function var() to check variance.
var(dataset$molGC....)
[1] 123.0872
The standard deviation is the square root of the variance. Use sd() to check the standard deviation.
sd(dataset$molGC....)
[1] 11.09447
Quartiles and Interquartile Ranges
Quartiles divide a dataset into four equal parts. The first quartile (Q1) is the value below which 25% of the data falls, the second quartile (Q2) is the median, and the third quartile (Q3) is the value below which 75% of the data falls.
The interquartile range is the range between the first quartile (Q1) and the third quartile (Q3). It represents the spread of the middle 50% of the data.
# Compute for Quartiles.
q <- quantile(dataset$molGC...., )
q1 <- quantile(dataset$molGC...., 0.25)
q2 <- quantile(dataset$molGC...., 0.50) # Median
q3 <- quantile(dataset$molGC...., 0.75)
print(q)
0% 25% 50% 75% 100%
22.0700 39.3105 46.3080 58.5770 72.6920
print(q1)
25%
39.3105
print(q2)
50%
46.308
print(q3)
75%
58.577
# Get five number summary
fivenum(dataset$molGC....)
[1] 22.070 39.310 46.308 58.577 72.692
# Summary() shows the five number summary plus the mean
summary(dataset$molGC....)
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.07 39.31 46.31 48.43 58.58 72.69
The quantile() function also lets you check percentiles other than common ones that make up the five number summary. To find percentiles, pass a vector of percentiles to the probs argument.
quantile(dataset$molGC....,
probs = c(0.1, 0.9)
) # get the 10th and 90th percentiles
10% 90%
35.2525 64.6980
Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile and the 1st quartile, which encompasses half the data. R has a built in IQR() fuction.
IQR(dataset$molGC....)
[1] 19.2665
The boxplots are just visual representations of the five number summary and IQR.
five_num <- fivenum(dataset$molGC....)
boxplot(dataset$molGC....)
text(x = five_num[1], adj = 2, labels = "Minimum")
text(x = five_num[2], adj = 2.3, labels = "1st Quartile")
text(x = five_num[3], adj = 3, labels = "Median")
text(x = five_num[4], adj = 2.3, labels = "3rd Quartile")
text(x = five_num[5], adj = 2, labels = "Maximum")
text(x = five_num[3], adj = c(0.5, -8), labels = "IQR", srt = 90, cex = 2)
De La Salle University, Manila, Philippines, daphne_janelyn_go@dlsu.edu.ph↩︎